still has a ways to go, but this is a start

all images can be scaled

Abstract¶

Singapore has emerged as one of the world’s most prosperous countries. In addition to being a financial center, its an achievement in urban planning and serves as a model for developing nations.

Public housing in Singapore is currently subsidized, built, and managed by the Government of Singapore. Singapore has one of the world’s highest home ownership rates. More than 80% of the 5.8M population live in Housing Development Board (HDB) apartments, commonly known as HDB "flats", of which more than 90% own their home. More than 90% of Singaporeans in public housing own the apartment they live in. The government subsidizes the cost of new homes, and buyers can get loans from the Housing Development Board, along with a 10% down payment. Singapore’s housing estates are considered mixed-income developments.

With more than 1 million flats spread across 24 towns and 3 estates, Singapore’s housing is uniquely different. As a geographic reference, Singapore is slightly more than 3.5 times the size of Washington, D.C. and yet has a somewhat high population.

Background - In 1960 the Singapore Housing and Development Board (HDB) was formed to provide affordable and high-quality housing for residents of this city-state nation. Housing is issued by the state on 99-year leaseholds, and the value of the home in general depends on many factors - (inherent utility value of the property, size/square footage, flat type and model, age, location, geographical proximity, etc).

We will explore and examine various factors (including geospatial features) that can be used to accurately predict the flat resale value…

Our goal is to identify the true drivers of the HDP flat resale price, and create an interactive system to predict these prices.

example of flat:


Data Used¶

Primary Data:¶

The primary sets of data utilized for the project included Singapore's HDB Resale Flat Prices (Resale transacted prices), and are currently published by the Singapore Housing and Development Board (HDB) and updated on a weekly basis.

  • Source: Data can be found here (https://data.gov.sg/dataset/resale-flat-prices)
  • Range: Dataset extends from January 1990 through present-day.
  • Baseline dataset consisted of five core data files, covering five time series ranges: 1990-1999, 2000-2012, 2012-2014, 2015-2016, 2017-present day. These files were merged into one consolidated master dataset and imported for analysis.
  • Features from this consolidated set include: month (the month/year of the resale transaction), town (the Singapore town of the actual flat), block (the address component of the flat), street name (the physical street address), flat type (the category of flat including room count), floor area (flat's square meter surface floor area), flat model (category of flat by model), lease commence data (the year the flat was commenced), storey range (the range of the flat storey/floor level, e.g. '04 to 06' identifying the flat was in the range between the fourth and sixth storey), remaining lease (the remaining years and months of the 99-year lease), and resale price (price sold for in Singapore dollars, and a major feature of this dataset we will be trying to predict). Data rows example can be found here, and high level column descriptions are included in this table.
  • Remaining lease: the number of years left before the lease ends; this information is computed as at the resale flat application.
  • Resale Price: these should be taken as indicative only as the resale prices agreed between buyers and sellers; these are dependent on many factors which we will explore.
  • Remember, lease left refers to the number of years to the expiry of the 99-year lease; after which, ownership of the HDB will return to the government. This is a very different concept than what is done in the United States.

  • Total Transaction Observations: 867,677 number of rows (with y features), covering 11,747 days.

Note: A helpful map to get a feel for Singapore in general is provided here


A dataset example below:

dataset_example.png


Secondary Data¶

Additional sources of data included:

  • HDP Resale Price Index (https://data.gov.sg/dataset/hdb-resale-price-index): This allowed scaling and normalization of the resale price values

plotted resale price index placeholder:

resale_price_index.jpg

Merged Spatial Data¶

  • This is kept here: insert

Ethical Considerations:¶

Our core data is governed by the Singapore Open Data License (https://data.gov.sg/open-data-licence), which aims to "promote and enable easy reuse of Public Sector data to create value for the Singapore area community and businesses".

According to the bylaws, we are allowed to use, access, download, copy, distribute, transmit, modify and adapt the datasets, or any derived analyses or applications. We followed this bylaw explicitly; we are not allowed to use the datasets in a way that suggests any official status or that a Singapore agency endorsed us or use of their set datasets. We specifically followed their guidance that in our application/website that uses the data, a conspicuous notice acknowledging the source of the datasets and including a link to the most recent version of their posted license be created.

Location Data:

  • None of the information included is proprietary, it is all public


EDA¶

After importing the data, we did conventional deep-dive Exploratory Data Analysis (EDA) in order to get a feel for the dataset. Plots were created to show the distribution of intial features, the value counts breakout per features.

Summary statistics are investigated and plotted as well.

Trying to identify anomalies to possibly be removed.

Average price per square-meter versus categories showed remarkable insight.

EDA Notebooks:¶

  • EDA part I: github notebook html
  • EDA part II: github notebook html
  • Feature Correlation Matrix square triangular

correlation_matrix_baseline_triangular.png


correlation_with_price_per-sqm_normed.png


Observations:¶

  • High amount of correlation, VIF used for analysis
  • The top 10 storey_ranges encompassed the vast majority of flats. In addition, there appeared to be a moderate correlation between resale price and height (storey range)
  • HDB flat floor area ranged from 28-307 square meters, in a trimodal distribution
  • Normalized resale price histogram was right skewed
  • Correlation - one of the strongest correlations was seen between normalize resale price and floor area in sq_m.
  • Certain towns appear to have higher value, we can attribute that to locations within the Central Region for instance (which is closest to the city center)m


Feature Engineering¶

Initial Dataset - Cleaning and Feature Creation:¶

  • Initial cleaning on the dataset to repeating flat_types
  • Merged together the block and the street_name to create an address
  • Created new storey_range_min and storey_range_max features (which were split from the storey_range text input (04 to 06 became 4 and 6)
  • nrooms (number of rooms) was created via the flat? (encompassing flats with floors from 1 to 5 rooms)
  • floor_area_sqm
  • One-hot encoded features such as abc and def
  • remaining_lease_years was created mathematically by using the original lease_commence_date and 99
  • remaining_lease_years was created from remaing_lease, dropping out remaining_lease_months.
  • It was necessary to normalize the resale price values taking into consideration the HDB Resale Price Index, which can be found here; this currently covers the time frame from January 1990 to September 2021. This data allowed us to normalize our resale price to a comparable value. Thus resale_price_norm was created from resale_price (converted via index).
  • Outliers are attempted to be removed

  • Multiple original-price features were created mathemetically: price_per_sq_ft, price_per_sq_m, price_per_sq_ft_per_lease_yr, and price_per_sq_m_per_lease_yr.

  • Multiple normalized-price features were created mathematically: price_per_sq_ft_norm, price_per_sq_ft_per_lease_yr_norm, price_per_sq_m_norm, and price_per_sq_m_per_lease_yr_norm.
  • End result was the following features, with the following as an example observation.
  • Singapore towns were mapped to pertinent regions (for instance, Central Region)
  • Data was peridiocially exported in parquet format due to speed (pkl)
  • flat_model was later identified as a candidate to remove (20 unique values), and encompassed what some others were doing


cleaned features - base dataset - example observation):

features_initial_normed.bmp


Merging Geo-spatial features for more data¶

An enormous amount of Singapore geospatial location features were consolidated and pushed to the database:

  • Market / Food Centers
  • Road information
  • Fire stations, policy stations, healthcare/hospitals,
  • MRT train station data (including number of lines, exit locations, etc.)
  • Bus stops, taxi stands
  • Schools (pre-schools,primary schools, secondary schools, high schools, etc), as well as their ranked score was uploaded
  • Conservation area information

Then it was possible to calculate the distances to nodes via the 'geometry' feature.

Python code was called to query OneMap API

https://epsg.io/4326


Machine Learning Modeling Approach¶

Initial Dataset - The initial dataset (prior to embedding location features) was split chronologically and was trained with Linear Regression, XGBoost, and Random Forest Regression models.

Evaluation - In order to evaluate our model, we needed metrics to tell us how accurate our predictions were, and what was the amount of deviation from the actual values. In order to determine how well the model fit the data, we used $R^2$ (the proportion of variance explained), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE).

The validation dataset was used to initially investigate the performance, but also used for tuning the hyperparameters, while the test dataset was used to evaluate the model's performance.

After extensive tuning, the Random Forest Regression model on the initial dataset was able to obtain a training R-square value of insrt, a validation R-square value of insrt, and a final test R-square value of insrt.

For XGBoost, objective= ‘reg:squarederror’ means that since we are faced with a regression problem, the objective will be to minimize the squared error. As always, the goal is to MINIMIZE the error, so the lower the MSE and RMSE are, the BETTER.

Final Dataset (including location features) - After the location features were embedded, the final dataset was trained with a XBGoost and Random Forest Regression models.

Interpretation: - the ability to intepret our model's output and feature importances was important to us. For a conventional Linear Regression model, there are standard coefficient outputs that are easy to understand, but due to the complexity of our data, our models would need to have more explainability due to the lack of a conventional coefficient.

Model feature importance outputs were possible in Random Forest Regression (example here).

SHAP (SHapley Additive exPlanations) is a graphical/numeric approach to explain the output of any machine learning model, and it is one of the items we chose to use to help explain our model outputs. An example of this output is provided here.

Overall Observations:

  • Overfitting in models is very common at the beginning and it is critical to identify (If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model most likely has overfit the training set). Tuning all models was necessary due to the value number of features we utilized.
  • Random Forest is a bagging technique that trains a number of decision trees on various subsets of the given data and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output. It is an ensemble algorithm. The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. Although random forest regression can lead to high accuracy, appears robust against outliers, and works well with non-linear data, there is a cost. In general they take a fair amount of time to train, and are not always a great fit for linear methods with a lot of sparse features.
  • XBGoost is hard to tune (due to a fair number of hyperparameters). Although there are considered about six key hyperparameters, this means that performing Gridsearch for instance takes time. They are both decision tree based. XBGoost: The more flexible and powerful an algorithm is, the more design decisions and adjustable hyper-parameters it will have. XBGoost has many hyperparemters, and even the top six for instance (learning rate, subsample and min child weight).
  • Things we could have done better was spread the training over multiple compute nodes, as it was a time-consuming task. If one performs exhaustive grid search, the results can take a long time.


Overall Observations - Base:

Base Model (not including feature locations:

feature_importances_random_forest_regressor.png


prediction_error_random_forest_regressor_baseline.png


residuals_random_forest_regressor_baseline.png


Overall Model Results¶

Feature Importances - The most important features appeared to be:

Initial Dataset - Feature importances for our best baseline random forest regressor are found here, with the following key variables:

# Final Dataset (including location features):  Current Model

# --- R2 Scores ---
Train:       0.962 
Validation:  0.897
Test:        0.795

#  --- MSE --- 
Train:       11.153
Validation:  40.907 
Test:        71.863



# New Model Output Results:

--- Test Set ---
Mean Absolute Error: ... 6.078747834396562
Mean Squared Error:..... 67.18
RMSE: .................. 8.196381335337657
Coeff of det (R^2):..... 0.809 (1.4 % better)   

--- Val Set ---
Mean Absolute Error: ... 4.498476271962494
Mean Squared Error:..... 38.02  
RMSE: .................. 6.166026942280251
Coeff of det (R^2):..... 0.904 (0.7 % better)   

--- Train Set ---
Mean Absolute Error: ... 2.339212417385334
Mean Squared Error:..... 10.24
RMSE: .................. 3.200258043072929
Coeff of det (R^2):..... 0.965


# New Model Hyperparameters (XGBoost-based)
    max_depth=7
    min_child_weight=6
    gamma = 10
    subsample=0.75
    colsample_bytree = 0.5
    reg_alpha = 100
    reg_lambda = 1
    n_estimators=800  (can add more if desired)
    learning_rate=0.16
    seed=42
    tree_method='hist


hyperparameters that seemed to matter:

image.png


SHAP final values (re-insert):

shap_cut.png


Limitations¶

It should be understood that there are market forces at play over the years during our data time range, so predicting resale prices will not be perfect.


The Application / Front-End¶

Database:¶

We chose to host the project's raw data in a database on Amazon. Amazon Relational Database Service (RDS) is a collection of managed cloud services that enabled our ability to set up, operate, and scale our database instance in the cloud.

Our choice for the database verion was PostgreSQL, a powerful open source object-relational database system with many years of active development and a strong reputation for reliability, robustness, and performance.

Our code was able to interact with the database via SQLAlchemy (a python SQL toolkit and Object Relational Mapper library). This allowed an engine connection to upload, store, manipulate, merge, and join various database tables of data. PostgreSQL dialect used psycopg2 as the default The PostgreSQL dialect uses psycopg2 as the default python DBAPI. We were able also able to connect to the database remotely via the Postgres tool pgAdmin, which allowed ease of viewing changes.

We enabled PostGIS (an extension in PostgreSQL for storing and managing spatial information) in AWS to consolidate location features. Spatial tables were set up in the database.


Database Contents:¶

Below shows the vast numer of database tables created:

huge_data_view.svg


Applying the Algorithm:¶

Database: insert a bunch of info here

Pipeline:¶

All data is housed in the Amazon RDS Postgres database, and updates to database tables are pushed periodically.


Further Investigation¶

Some additional areas we plan on researching and investigating:

Modified Distance - Potentially adding a modification of the straight-line geo-spatial distance features in 'Manhattan' form (i.e. taxi-cab / city-block geometry). Singapore is a very walkable city, and distances from the HDB flat to nearby feature locations (such as hospitals, etc) many times are only possible via sidewalk or city streets. We also have examined adding potentially in the future a feature of the total travel time (whether walking or driving) to go from the flat to the specific destination location.

Crime Data - Overlaying another set of features associated with historical crime (similar to NY city’s statistics) is an option. Although crime is extremely low in Singapore, it may be interesting to see if this is a factor in resale pricing values.

PLH (Prime Location Housing) - Scenarios where there are no HDBs there currently (and this is direct from HDB, not a resale transaction). PLH is a new scheme of housing that was recently launched which includes more restrictions when it comes to resale. The concept would be that one could say that a resale could potentially fetch a certain dollar amount based on the machine learning model, allowing a mapping from HDB to appreciation of another set amount. This new housing model for public flats in prime areas includes owners of BTO flats in those areas facing a 10-year minimum occupation period; these flats will be priced with additional subsidies; those who sell their BTO units will have to pay back HDB a percentage of the resale price. The resale buyer criteria for these units will be tighter than for typical resale units…

Unsupervised Clustering - Deeper dive into clustering

Interactivity - Plan to investigate adding more features to the front-end application, including A and B and C


Statement of Work¶

Work Breakout was the following:

Michael - insert

Stuart - insert

Tom - insert


Appendix¶

arcGIS view if needed, alllowing view of layers as needed, for familiarity with the area: LINK:

click layers on left-hand-side for filtering

image.png


data_chart.svg


example of initial dataset observation:

example_dataset.svg